Lab 5: Evaluatioin and Multi-Layer Preceptron
Cameron Matson
Zihao Mao
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
In this lab we aim to predict the position of an NBA player given some set of descriptors and statistics of their play. The original data set contains the player information and stats from every player that played in the NBA from 1950 to present.
Professional sports teams in general, and basketball teams in particular, have turned in recent years to data to gain an edge over their opponents. Part of the responsibilities of the team ownership and coaching staff are to assemble a team of players that give them best chance to win. Players on an NBA team have specific roles, largely based on the position that they play. It is, therefore, important that player play in the posision that most helps their team to win. However in todays NBA, positions are much more fluid and players can play positions that the classical baskeball system wouldn't expect (the Golden State Warriors of late.) We would like to create a classifier that would help NBA teams make decisions on which position players should play, which may or may not necessarilly be the position they have actually played. This could especially be the case for players coming from college, where teams generally run systems quite different from that employed in the NBA, or for teams looking to make a change, but are unable to trade for a new player. They might be able instead to reposition one of their current players
What our classifier will do, then is look at the stats and player details for each position and then, given a new set of stats and player make a probalistic estimation of which position that player plays. If the result diverges from the players listed position, this might be an indication that the player is not playing the correct position.
With respect to accuracy, its important to keep in mind that our classifier is meant to aid basketball personel decisions, not make them authoritatively. It's primary use is as a tool to compare a player to the performance and description of a players in the past, and report on the similarities the unkonwn player has to the different positions historically. So while we'd like to see as close to perfection in terms of prediction on splits of our dataset, what will ultimately be most useful to us and NBA teams is if the reported probability of a player playing any given position is significantly higher than that of the other possible positions.
# first lests load the datasets in
data_path = '../data/basketball'
players = pd.read_csv(os.path.join(data_path, 'players.csv'))
players.head()
All we want from this is the players name and their height (cm) and weight (kg).
players.drop(['Unnamed: 0', 'collage', 'birth_city', 'birth_state', 'born'], axis=1, inplace=True)
players.info()
Good. They're all non null, and seem to be the correct datatype.
Now let's load the players stats.
stats = pd.read_csv(os.path.join(data_path, 'seasons_stats.csv'))
stats.info()
There are a lot of fields here, and they're pretty inconsistently filled. Some of this arises from the fact that its such a long timeline. For example, in 1950, there was no such thing as a 3-pointer, so it wouldn't make sense for those players to have 3pt% stats.
Inspecting the dataset a little further, we notice that there is no stat for points per game (PPG). The total number of points scored is listed, but that is hard to compare across seasons where they played different games. To make the dataset more valid, i.e. to make the points column a valid comparisson measure, we'll only consider seasons in which they played the current full 82 game schedule. Which doesn't reduce the power of the dataset by that much, they moved to a 82 game season in 1967, and only the lockout shortened 1998-99 season didn't have a full scehdule.
Actually we might want to limit it to just seasons after 1980 when they introduced the 3 pointer. That should just make the prediction task easier, although we lose even more of the dataset. But if we consider the business case as being how to decide players posisitions TODAY it makes sense.
stats = stats[stats.Year >= 1980]
stats = stats[stats.Year != 1998]
stats.info()
Now, to start, lets just focus on a few categories
And of course our label: position. We could probably use any of the features as a label actually, and see if one could predict performance in one aspect of the game based on info in the another. But for now we'll stick with predicting position.
stats_to_keep = {'Player', 'Pos', 'G', 'MP', 'FG', 'FGA', 'FT', 'FTA',
'2P', '2PA', '3P', '3PA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK',
'TOV', 'PF', 'PTS'}
stats_to_drop = set(stats.columns)-stats_to_keep
stats.drop(stats_to_drop, axis=1, inplace=True)
stats.info()
Okay. Finally, let's add the player description data to the stats dataframe.
stats['height'] = np.nan
stats['weight'] = np.nan
iplayer = players.set_index(keys='Player')
istats = stats.reset_index(drop=True)
for i, row in istats.iterrows():
name = row[0]
h = iplayer.loc[name].loc['height']
w = iplayer.loc[name].loc['weight']
# height and weight show up in the last two columns
istats.iloc[i, len(istats.columns) - 2] = h
istats.iloc[i, len(istats.columns) - 1] = w
stats = istats
# and now we don't need the names anymore
players = stats.Player
stats.drop(['Player'], axis=1, inplace=True)
Finally let's convert the position from a string to a number 1-5, and divide up the dataset into data and label (X and y). There are technically more than 5 listed (some are multiple positions listed,) but we'll just go off of the first-listed, primary position.
# first we need to separate out the label from the data
y = stats.Pos
set(y)
lets reclassify this numerically and only count their 'primary ' position, so that each player will be given a position 1-5 {(1, pg), (2, sg), (3, sf), (4, pf), (5, c)}
import numpy as np
def convert_pos(y):
newy = np.zeros((len(y), 1))
for i, player in enumerate(y):
if (player[0] == 'C'):
newy[i] = 5
elif (player[0:2] == 'PF'):
newy[i] = 4
elif (player[0:2] == 'SF'):
newy[i] = 3
elif (player[0:2] == 'SG'):
newy[i] = 2
elif (player[0:2] == 'PG'):
newy[i] = 1
return newy
y = convert_pos(y)
y
# let's update it in the dataframe just for fun
stats['Pos'] = y
y = y.ravel()
from sklearn import preprocessing
# now we can drop the label, and we'll scale it to zero mean
X = stats.drop(['Pos'], axis=1)
ss = preprocessing.StandardScaler()
X = ss.fit_transform(X)
X
Before we go any further let's see what some of the distributions are
np.bincount(y.astype(np.int32))
# first, how many of each class
import plotly
plotly.offline.init_notebook_mode() # run at the start of every notebook
graph1 = {'labels': ['PG', 'SG', 'SF', 'PF', 'C'],
'values': np.bincount(y.astype(np.int32)-1), # -1 because it's looking for 0
'type': 'pie'}
fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Total Class Distribution',
'autosize':False,
'width':500,
'height':400}
plotly.offline.iplot(fig)
Of the five positions, each makes up about 20% of the dataset, what this means is that when we do our cross-validation, we don't need to worry about doing stratified splits.
Accuracy, precision, recall and F1-measure could all be one of the metrics used to consider to evalute our method. For the specific data set we have chosen, we decide to use the accuracy and F-measure as the evaluating metrics. Accuracy is the most common metrics used, even though it doesn't consider the cost of misclassifications in the data set, which can leads the results of accuracy socores be mostly pointless. However, we still think it's an appropriate metrics to be used for us since the class distribution in our data set is very even. Mostly each of the 5 classes covers 20% of the overall dataset.
F1-measure combines the measure of the precision and recall. It can help to fix the problem of accuracy that doesn't consider the misclassification cost, and overall improves our evaluation to the algorithm.
In order to better train and tune the hyper-parameters, it's helpful to split the original data set into training set, test set and validation set. It's a very large data set and the classes are well distributed. We would use the 20% of the whole data set to be the test set, 80% to be the train set, and 20% of the train set to be validation set.
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Let's see if we can do this. Below is our implementation of the MLP, largely based on the code from class, but with a few changes and additions. It's massive
# Example adapted from https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb
# Original Author: Sebastian Raschka
# This is the optional book we use in the course, excellent intuitions and straightforward programming examples
# please note, however, that this code has been manipulated to reflect our assumptions and notation.
from scipy.special import expit
from sklearn.metrics import accuracy_score, f1_score
import sys
# start with a simple base classifier, which can't be fit or predicted
# it only has internal classes to be used by classes that will subclass it
class MultiLayerPerceptron(object):
def __init__(self, n_hidden_neurons=(30,),
C=0.0, epochs=500, eta=0.001, random_state=None,
activation='sigmoid', cost_function='mse',
tol=1e-6):
np.random.seed(random_state)
self.n_hidden_layers = len(n_hidden_neurons)
self.n_hidden_neurons = n_hidden_neurons
self.C = C
self.epochs = epochs
self.eta = eta
self.activation = activation
self.cost_function = cost_function
self.tol = tol
print('layers:', self.n_hidden_layers, 'neurons per layer:', self.n_hidden_neurons)
@staticmethod
def _encode_labels(y):
"""Encode labels into one-hot representation"""
onehot = pd.get_dummies(y).values
return onehot
def _initialize_weights(self):
"""Initialize weights with small random numbers."""
# there are n_layers W matrices
W_matrices = []
if (self.n_hidden_layers == 0):
print('no hidden neurons')
W_num_elems = (self.n_features_+1)*self.n_output_
W = np.random.uniform(-1.0, 1.0, size=W_num_elems)
W = W.reshape(self.n_features_+1, self.n_output_)
W_matrices.append(W)
return W_matrices
# initial layers
W1_num_elems = (self.n_features_ + 1)*self.n_hidden_neurons[0] # better give us atleast one
W1 = np.random.uniform(-1.0, 1.0,size=W1_num_elems)
W1 = W1.reshape(self.n_features_ + 1, self.n_hidden_neurons[0]) # (N+1 x S_0)
W_matrices.append(W1)
# hidden layers
for S in range(1, self.n_hidden_layers):
W_num_elems = (self.n_hidden_neurons[S-1] + 1)*(self.n_hidden_neurons[S])
W = np.random.uniform(-1.0, 1.0, size=W_num_elems)
W = W.reshape(self.n_hidden_neurons[S-1] + 1, self.n_hidden_neurons[S])
W_matrices.append(W)
# final layer
Wf_num_elems = (self.n_hidden_neurons[-1] + 1) * self.n_output_
Wf = np.random.uniform(-1.0, 1.0, size=Wf_num_elems)
Wf = Wf.reshape(self.n_hidden_neurons[-1] + 1, self.n_output_)
W_matrices.append(Wf)
#print('W shapes')
#for i, W in enumerate(W_matrices):
# print(i, W.shape)
return W_matrices
@staticmethod
def _sigmoid(z):
"""Use scipy.special.expit to avoid overflow"""
# 1.0 / (1.0 + np.exp(-z))
return expit(z)
@staticmethod
def _add_bias_unit(X_train, how='column'):
"""Add bias unit (column or row of 1s) to array at index 0"""
if how == 'column':
ones = np.ones((X_train.shape[0], 1))
X_new = np.hstack((ones, X_train))
elif how == 'row':
ones = np.ones((1, X_train.shape[1]))
X_new = np.vstack((ones, X_train))
return X_new
@staticmethod
def _L2_reg(lambda_, W_matrices):
#def _L2_reg(lambda_, W1, W2):
"""Compute L2-regularization cost"""
# only compute for non-bias terms
s = 0
for W in W_matrices:
s += np.mean(W[1:,:]**2)
return (lambda_/2.0) * np.sqrt(s)
def _cost(self, A_final, Y_enc, W_matrices):
#def _cost(self,A3,Y_enc,W1,W2):
'''Get the objective function value'''
cost = np.mean((Y_enc-A_final)**2)
L2_term = self._L2_reg(self.C, W_matrices)
return cost + L2_term
#def _feedforward(self, X, W1, W2):
def _feedforward(self, X_train, W_matrices):
'''Compute feedforward step'''
A_matrices = []
Z_matrices = []
# A1 is just the bias added data matrix X (M x N+1)
A1 = self._add_bias_unit(X_train, how='column')
A_matrices.append(A1)
for S, W in enumerate(W_matrices):
Z = A_matrices[S] @ W # ((M x N+1) x (N+1 x S_i)) = (M x S_i)
Z_matrices.append(Z)
# sigmoid or linear actiavtion
if (self.activation == 'sigmoid'):
A = self._sigmoid(Z)
else:
A = Z
A = self._add_bias_unit(A, how='column') # (M x S_i+1)
A_matrices.append(A)
# remove the bias from the last A
A_matrices[-1] = A_matrices[-1][:,1:]
return A_matrices, Z_matrices
def _get_gradient(self, A_matrices, Z_matrices, Y_enc, W_matrices):
#def _get_gradient(self, A1, A2, A3, Z1, Z2, Y_enc, W1, W2):
""" Compute gradient step using backpropagation.
"""
sigma_matrices = []
grad_matrices =[]
# different cost functions
if (self.cost_function == 'mse'):
sigma_final = -2*(Y_enc-A_matrices[-1])
if (self.activation == 'sigmoid'):
sigma_final *= A_matrices[-1]*(1-A_matrices[-1])
elif (self.cost_function == 'cross'):
sigma_final = (A_matrices[-1] - Y_enc)
sigma_matrices.append(sigma_final)
# based on penultimate A
grad_final = sigma_final.T @ A_matrices[-2]
grad_final[:, 1:] += (W_matrices[-1]).T[:, 1:] * self.C
grad_matrices.append(grad_final)
# move backwards starting with second to last
for i in range(1, len(W_matrices)):
A2 = A_matrices[-(1+i)]
A1 = A_matrices[-(2+i)]
W2 = W_matrices[-i]
W1 = W_matrices[-(1+i)]
sigma = (sigma_matrices[i-1] @ W2.T)
if (self.activation == 'sigmoid'):
sigma *= A2*(1-A2)
# if linear we're done
sigma = sigma[:, 1:] # remove bias column
sigma_matrices.append(sigma)
grad = sigma.T @ A1
grad[:, 1:] += (W1.T)[:, 1:] * self.C
grad_matrices.append(grad)
# flip'em back around
grad_matrices.reverse()
return grad_matrices
def predict(self, X_train):
"""Predict class labels"""
A, _ = self._feedforward(X_train, self.W_matrices)
Afinal = A[-1] # because off by one for positions
y_pred = np.argmax(Afinal, axis=1) + 1
return y_pred
def score(self, *args):
yhat = self.predict(X_train)
return accuracy_score(y_train, yhat)
def f_score(self,*args):
yhat = self.predict(X_train)
return f1_score(y_train,yhat,average='micro')
def conf_matrix(self, *args):
yhat = self.predict(X_train)
return confusion_matrix(y_train,yhat)
# lifted straight out of sklearn source code with some modifications
@classmethod
def _get_param_names(cls):
# this is just specific to this classifier
return sorted(['eta', 'epochs', 'C', 'n_hidden_neurons'])
def get_params(self, deep=True):
"""Get parameters for this estimator.
Parameters
----------
deep : boolean, optional
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
Returns
-------
params : mapping of string to any
Parameter names mapped to their values.
"""
out = dict()
for key in self._get_param_names():
# We need deprecation warnings to always be on in order to
# catch deprecated param values.
# This is set in utils/__init__.py but it gets overwritten
# when running under python3 somehow.
warnings.simplefilter("always", DeprecationWarning)
try:
with warnings.catch_warnings(record=True) as w:
value = getattr(self, key, None)
if len(w) and w[0].category == DeprecationWarning:
# if the parameter is deprecated, don't show it
continue
finally:
warnings.filters.pop(0)
# XXX: should we rather test if instance of estimator?
if deep and hasattr(value, 'get_params'):
deep_items = value.get_params().items()
out.update((key + '__' + k, val) for k, val in deep_items)
out[key] = value
return out
def set_params(self, **params):
"""Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
``<component>__<parameter>`` so that it's possible to update each
component of a nested object.
Returns
-------
self
"""
if not params:
# Simple optimization to gain speed (inspect is slow)
return self
valid_params = self.get_params(deep=True)
# changed from six.iteritems() bc no need for py2 vs py3 compatabillity
for key, value in params.items():
split = key.split('__', 1)
if len(split) > 1:
# nested objects case
name, sub_name = split
if name not in valid_params:
raise ValueError('Invalid parameter %s for estimator %s. '
'Check the list of available parameters '
'with `estimator.get_params().keys()`.' %
(name, self))
sub_object = valid_params[name]
sub_object.set_params(**{sub_name: value})
else:
# simple objects case
if key not in valid_params:
raise ValueError('Invalid parameter %s for estimator %s. '
'Check the list of available parameters '
'with `estimator.get_params().keys()`.' %
(key, self.__class__.__name__))
setattr(self, key, value)
return self
def fit(self, X, y, print_progress=False):
""" Learn weights from training data."""
X_data, y_data = X.copy(), y.copy()
Y_enc = self._encode_labels(y)
# init weights and setup matrices
self.n_features_ = X_data.shape[1]
self.n_output_ = Y_enc.shape[1]
self.W_matrices = self._initialize_weights()
self.cost_ = []
self.score_ = []
for i in range(self.epochs):
if print_progress>0 and (i+1)%print_progress==0:
sys.stderr.write('\rEpoch: %d/%d' % (i+1, self.epochs))
sys.stderr.flush()
# feedforward all instances
A_matrices, Z_matrices = self._feedforward(X_data,self.W_matrices)
# back prop
grads = self._get_gradient(A_matrices, Z_matrices, Y_enc, self.W_matrices)
for j, grad in enumerate(grads):
self.W_matrices[j] -= self.eta * grad.T
cost = self._cost(A_matrices[-1], Y_enc, self.W_matrices)
self.cost_.append(cost)
self.score_.append(self.score())
# early stopping
if i > 1 and abs(self.cost_[-1] - self.cost_[-2]) < self.tol:
break
self.echochs_actual_ = i + 1
return self
Let's check to see how it does.
nn = MultiLayerPerceptron()
%%time
nn.fit(X_train, y_train, print_progress=1)
yhat = nn.predict(X_train)
print('Test acc: ', nn.score())
plt.plot(range(len(nn.cost_)), nn.cost_, nn.score_)
plt.legend(['cost', 'score'])
plt.ylabel('Cost')
plt.xlabel('Epochs')
plt.tight_layout()
plt.show()
from sklearn.model_selection import GridSearchCV
import warnings
parameters = {'epochs': [50, 100, 300],
'eta': np.logspace(base=10, start=-5, stop=1, num=4),
'C': np.logspace(base=10, start=-3, stop=1, num=4)
}
nncv = MultiLayerPerceptron(cost_function='mse', tol=1e-3)
clf = GridSearchCV(nncv, parameters, verbose=3)
clf.fit(X_train, y_train)
cvres = pd.read_csv(os.path.join(data_path, 'hyper_param_gs.csv'))
#cvres = pd.DataFrame(clf.cv_results_)
#cvres.to_csv(os.path.join(data_path, 'hyper_param_gs.csv'))
cvres
Let's look at what the best performing parameters were.
clf.best_score_
clf.best_params_
nnn = clf.best_estimator_
plt.plot(range(len(nnn.cost_)), nnn.cost_, nnn.score_)
plt.ylabel('Cost/Score')
plt.legend(['cost', 'score'])
plt.xlabel('Epochs')
plt.tight_layout()
plt.show()
Let's see how the score varaies over these different parameters
batch300 = cvres[cvres.param_epochs == 300]
sns.pointplot(batch300.param_C.values, batch300.mean_test_score.values, hue=batch300.param_eta)
plt.title('Score vs. Regularization Coefficient C')
plt.xlabel('C')
plt.show()
Most of the time the regularization coefficent has little effect on the score, except in the case of the best eta, where it had mixed results, but in general, the larger the values of C performed better.
batch300 = cvres[cvres.param_epochs == 300]
sns.pointplot(batch300.param_eta.values, batch300.mean_test_score.values, hue=batch300.param_C)
plt.title('Score vs. Regularization Coefficient C')
plt.xlabel('eta')
plt.show()
Here we see that there is a sweet spot for the value of eta. As eta increases, the score initially increases and then drops of steeply. Once eta gets too large the performance plateuas regardless of the regularization.
from sklearn.neural_network import MLPClassifier
sknn = MLPClassifier(hidden_layer_sizes=(30,),
activation='logistic',
max_iter=300,
learning_rate_init=0.001,
alpha=10)
%time sknn.fit(X_train, y_train)
yhat = sknn.predict(X_train)
print('Validation Acc:',accuracy_score(yhat,y_train))
nn = MultiLayerPerceptron(cost_function='mse',
C=10.0, epochs=300, eta = 0.001,
tol = 1e-3)
nn.fit(X_train, y_train, print_progress=1)
conf_matrix = nn.conf_matrix()
import matplotlib.pylab as pl
pl.figure()
tb = pl.table(cellText=conf_matrix, loc=(0,0), cellLoc='center')
tc = tb.properties()['child_artists']
for cell in tc:
cell.set_height(1/5)
cell.set_width(1/5)
ax = pl.gca()
ax.set_xticks([])
ax.set_yticks([])
To better see the performance of the multilayer perceptron, we would like to see the performance with more than 1 hidden layer. In the following code, we experinment with the max 3 hidden layers. Due to the limitation of time and calculation capacity of our hardware, we could only take the number of neurons and layers as our variable, and keep the other hyper parameters constant. The best performed parameters of a single hidden layer we found earlier will be applied here. For the same reason, we choose 20, 30, and 50 as the resonable number of neurons to be tested here, and we would test the different permuations of the number up on the maximum 3 layers.
num_neu_list = []
acc_list = []
f1_list = []
import itertools
num_layers = [1,2,3]
num_neu = [20,30,50]
with open('names.csv', 'w') as csvfile:
for element in itertools.product(num_neu):
nn = MultiLayerPerceptron(cost_function='mse',
n_hidden_neurons=element,
C=10.0, epochs=300, eta = 0.001,
tol = 1e-3)
nn.fit(X_train, y_train, print_progress=1)
yhat = nn.predict(X_train)
print('Test acc: ', nn.score())
print("f1 score:", nn.f_score())
num_neu_list.append(element)
acc_list.append(nn.score())
f1_list.append(nn.f_score())
for element in itertools.product(num_neu,num_neu):
nn = MultiLayerPerceptron(cost_function='mse',
n_hidden_neurons=element,
C=10.0, epochs=300, eta = 0.001,
tol = 1e-3)
nn.fit(X_train, y_train, print_progress=1)
yhat = nn.predict(X_train)
print('Test acc: ', nn.score())
print("f1 score:", nn.f_score())
num_neu_list.append(element)
acc_list.append(nn.score())
f1_list.append(nn.f_score())
for element in itertools.product(num_neu,num_neu,num_neu):
nn = MultiLayerPerceptron(cost_function='mse',
n_hidden_neurons=element,
C=10.0, epochs=300, eta = 0.001,
tol = 1e-3)
nn.fit(X_train, y_train, print_progress=1)
yhat = nn.predict(X_train)
print('Test acc: ', nn.score())
print("f1 score:", nn.f_score())
num_neu_list.append(element)
acc_list.append(nn.score())
f1_list.append(nn.f_score())
As we can notice from above, for some weird reasons, the accuracy score and f1 micro score we got here are identitacal for every cases. One of the possible reasons is that we are using the micro average option for the f1_score function from sk-learn. As they mentionted in their documentation, "Note that for “micro”-averaging in a multiclass setting with all labels included will produce equal precision, recall and F, while “weighted” averaging may produce an F-score that is not between precision and recall." However, it doesn't relate to the accuracy here. So far, we do not have a certain reason why they are identical, and we would explore further if we have more time. In the graph below, we would take f1 score as the main metric to evaluate the performance of each neuron network.
MLP_scores = pd.DataFrame()
MLP_scores['Num_neu'] = num_neu_list
MLP_scores['Accuracy'] = acc_list
MLP_scores["F1"] = f1_list
MLP_scores.head()
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()
MLP_scores.Num_neu = MLP_scores.Num_neu.astype(str)
MLP_scores.dtypes
data = [go.Bar(
x=MLP_scores["Num_neu"],
y=MLP_scores["F1"],
error_y=dict(
type='data',
),
)]
layout = go.Layout(
title = "F1 Scores of Multi Layers"
)
fig = go.Figure(data = data, layout = layout)
plotly.offline.iplot(fig, filename='basic-bar')
According to the graph above, the network with a single layer and 20 neurons has the best performance which reaches more than 0.64 f1 scores. The second best is the one with 2 hidden layers of (50, 20) number of neurons and score of 0.5735, which does slightly better than the (50,30) with the score of 0.5718.